Imagine you've been commissioned by the City of San Francisco to tackle a problem they've been having with local flora. The parks department has taken extensive documentation of the city's trees since the 1970s - what species are growing, where they are, who they're maintained by - amassing a dataset of over 200K trees in that time.
The funding for that project has recently been called into question, and the City Board needs to see its value in reapproving funds for the following year. Stakeholders have raised several concerns over the past few years, and your job is to use the data to answer them. Good luck!
First things first, let's get some terminology straight.
.ipynb file. These are pretty special, also known as Jupyter notebooks. Jupyter notebooks have a few special properties that make it ideal for work with data:
print()x = 'Answer to the Ultimate Question of Life, the Universe, and Everything'
print(x) # Run this cell after running the one above, and again after running the one below
Answer to the Ultimate Question of Life, the Universe, and Everything
x = 42
Anything you can do in Python, you can do here!
def UltimateQuestion(computer_name):
return computer_name + ' is thinking...'
UltimateQuestion('Deep Thought')
'Deep Thought is thinking...'
We use the pandas package to easily work with data as tables.
The numpy package allows us to work with some other special data types, like missing values
We'll rename these as pd and np, just so its easier to refer to later on
import pandas as pd
import numpy as np
For this semester, we'll typically work with data in tabular format, the type you'd be used to in an excel spreadsheet. Data files saved in this format will usually have a .csv file ending, short for comma seperated values.
For example, a CSV file could look something like...
tree_number, species_name, address
312, Magnolia grandiflora, 2828 Divisadero St
124, Melaleuca quinquenervia, 485 Union St
912, Pittosporum undulatum, 47 Vicksburg St
To import this, let's use the pd.read_csv() function:
url = 'https://raw.githubusercontent.com/ishaandey/node/master/week-1/workshop/trees.csv'
trees = pd.read_csv(url)
Here, we've saved the data to a dataframe object named trees
type(trees)
pandas.core.frame.DataFrame
DataFrames contain our data in little "spreadsheet"-like structures. Whatever manipulations you can think of doing to the data, you can likely search how to do
Let's take a look at the data. We'll use the function .head() to read in the first 5 rows
trees.head()
| tree_id | legal_status | caretaker | dbh | plot_size | species_name | common_name | date | site_location | site_type | latitude | longitude | address | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 30314 | DPW Maintained | Private | 16.0 | NaN | Pittosporum undulatum | Victorian Box | 1955-10-20 | Sidewalk | Cutout | 37.759772 | -122.398109 | 501 Arkansas St |
| 1 | 30321 | DPW Maintained | Private | 2.0 | NaN | Magnolia grandiflora | Southern Magnolia | 1956-01-06 | Sidewalk | Cutout | 37.795718 | -122.441860 | 2828 Divisadero St |
| 2 | 30334 | DPW Maintained | Private | 4.0 | NaN | Ginkgo biloba | Maidenhair Tree | 1956-02-06 | Sidewalk | Cutout | 37.743222 | -122.433634 | 601 29th St |
| 3 | 30335 | DPW Maintained | Private | 2.0 | NaN | Ginkgo biloba | Maidenhair Tree | 1956-02-06 | Sidewalk | Cutout | 37.743226 | -122.433565 | 601 29th St |
| 4 | 30333 | DPW Maintained | Private | 1.0 | NaN | Arbutus 'Marina' | Hybrid Strawberry Tree | 1956-02-06 | Sidewalk | Cutout | 37.743217 | -122.433721 | 601 29th St |
How big is the dataset? .shape returns a tuple with the dimensions as (rows, columns)
trees.shape
(36073, 13)
Let's try to understand our data a bit better.
trees.species_name.nunique()
367
trees.common_name.value_counts()
Swamp Myrtle 2781
Brisbane Box 2751
Hybrid Strawberry Tree 1968
Victorian Box 1604
Southern Magnolia 1602
...
Peppermint Box 1
Yucca 1
Flannel Bush Tree 1
Burgundy Sweet Gum 1
Birchbark Cherry 1
Name: common_name, Length: 365, dtype: int64
Show the biggest trees by sorting the dataframe:
Note: dbh records diameter of the tree base
trees.sort_values(by='dbh', ascending=False)
| tree_id | legal_status | caretaker | dbh | plot_size | species_name | common_name | date | site_location | site_type | latitude | longitude | address | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 34738 | 14513 | DPW Maintained | DPW | 100.0 | 4X4 | Fraxinus uhdei | Shamel Ash: Evergreen Ash | 2018-06-18 | Sidewalk | Cutout | 37.776560 | -122.446728 | 501 Masonic Ave |
| 28183 | 12738 | DPW Maintained | DPW | 100.0 | 4x4 | Tristaniopsis laurina 'Elegant' | Small-leaf Tristania 'Elegant' | 2013-07-12 | Sidewalk | Cutout | 37.786183 | -122.477196 | 1630 Lake St |
| 5025 | 4768 | DPW Maintained | DPW | 100.0 | 3X3 | Corymbia ficifolia | Red Flowering Gum | 1993-01-05 | Sidewalk | Cutout | 37.732715 | -122.385231 | 26 Commer Ct |
| 17964 | 24961 | DPW Maintained | DPW | 90.0 | 20 | Phoenix canariensis | Canary Island Date Palm | 2005-04-21 | Median | Cutout | 37.767709 | -122.426675 | 100 Dolores St |
| 5581 | 13104 | DPW Maintained | DPW | 90.0 | 3X3 | Ficus retusa nitida | Banyan Fig | 1993-10-26 | Sidewalk | Cutout | 37.801143 | -122.426724 | 1530 Lombard St |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 14101 | 78518 | DPW Maintained | Private | 0.0 | NaN | Prunus cerasifera | Cherry Plum | 2000-10-21 | Sidewalk | Cutout | 37.710295 | -122.450931 | 75 Laura St |
| 14114 | 78567 | DPW Maintained | Private | 0.0 | NaN | Arbutus 'Marina' | Hybrid Strawberry Tree | 2000-10-21 | Sidewalk | Cutout | 37.710306 | -122.453138 | 40 Sears St |
| 14763 | 44728 | DPW Maintained | Private | 0.0 | NaN | Melaleuca quinquenervia | Cajeput | 2001-04-03 | Sidewalk | Cutout | 37.748648 | -122.477643 | 1144 Quintara St |
| 14796 | 44797 | DPW Maintained | Private | 0.0 | NaN | Prunus serrulata | Ornamental Cherry | 2001-04-12 | Sidewalk | Cutout | 37.765145 | -122.480368 | 1206 22nd Ave |
| 36072 | 144192 | DPW Maintained | Private | 0.0 | Width 4ft | Lophostemon confertus | Brisbane Box | 2020-01-25 | Sidewalk | Cutout | 37.776940 | -122.502697 | 618 42nd Ave |
36073 rows × 13 columns
Subsetting is a super helpful tool. We'll take a look at this more depth in next week, but for now, here are the basics:
We can filter rows from a dataframe based on some condition
Cherry Plum treestrees[trees.common_name == 'Cherry Plum']
| tree_id | legal_status | caretaker | dbh | plot_size | species_name | common_name | date | site_location | site_type | latitude | longitude | address | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 149 | 53700 | Permitted Site | Private | 14.0 | NaN | Prunus cerasifera | Cherry Plum | 1970-03-04 | Sidewalk | Cutout | 37.746081 | -122.426025 | 263 Duncan St |
| 198 | 54020 | DPW Maintained | Private | 13.0 | NaN | Prunus cerasifera | Cherry Plum | 1972-04-07 | Sidewalk | Cutout | 37.772780 | -122.494875 | 862 35th Ave |
| 208 | 54057 | DPW Maintained | Private | 8.0 | NaN | Prunus cerasifera | Cherry Plum | 1972-04-21 | Sidewalk | Cutout | 37.772551 | -122.494860 | 874 35th Ave |
| 265 | 54255 | Permitted Site | Private | 10.0 | 3x3 | Prunus cerasifera | Cherry Plum | 1972-07-03 | Sidewalk | Cutout | 37.759509 | -122.442802 | 191 Caselli Ave |
| 364 | 221734 | DPW Maintained | Private | 12.0 | Width 4ft | Prunus cerasifera | Cherry Plum | 1972-08-17 | Sidewalk | Cutout | 37.765292 | -122.452934 | 203 Carl St |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 35535 | 55973 | DPW Maintained | Private | 3.0 | NaN | Prunus cerasifera | Cherry Plum | 2019-06-10 | Sidewalk | Cutout | 37.791259 | -122.432719 | 2221 Webster St |
| 35571 | 236272 | DPW Maintained | Private | 3.0 | Width 3ft | Prunus cerasifera | Cherry Plum | 2019-07-26 | Sidewalk | Cutout | 37.766989 | -122.416495 | 99 Shotwell St |
| 35572 | 236271 | DPW Maintained | Private | 3.0 | Width 3ft | Prunus cerasifera | Cherry Plum | 2019-07-26 | Sidewalk | Cutout | 37.767032 | -122.416501 | 99 Shotwell St |
| 35700 | 246210 | DPW Maintained | Private | 3.0 | Width 0ft | Prunus cerasifera | Cherry Plum | 2019-10-01 | Sidewalk | Cutout | 37.767967 | -122.443800 | 725 Buena Vista Ave West |
| 35701 | 246211 | DPW Maintained | Private | 3.0 | Width 0ft | Prunus cerasifera | Cherry Plum | 2019-10-01 | Sidewalk | Cutout | 37.767917 | -122.443821 | 725 Buena Vista Ave West |
1180 rows × 13 columns
How would you show only trees north of Golden Gate Park (latitude > 37.77285)
Hint: Same way as matching if statements in python, mirroring the syntax above
trees[trees.latitude > 37.77285]
| tree_id | legal_status | caretaker | dbh | plot_size | species_name | common_name | date | site_location | site_type | latitude | longitude | address | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 30321 | DPW Maintained | Private | 2.0 | NaN | Magnolia grandiflora | Southern Magnolia | 1956-01-06 | Sidewalk | Cutout | 37.795718 | -122.441860 | 2828 Divisadero St |
| 5 | 30339 | DPW Maintained | Private | 11.0 | NaN | Platanus x hispanica | Sycamore: London Plane | 1956-02-15 | Sidewalk | Cutout | 37.793189 | -122.441380 | 2560 Divisadero St |
| 6 | 30337 | DPW Maintained | Private | 12.0 | NaN | Platanus x hispanica | Sycamore: London Plane | 1956-02-15 | Sidewalk | Cutout | 37.793242 | -122.441395 | 2560 Divisadero St |
| 7 | 30341 | DPW Maintained | Private | 10.0 | NaN | Acacia melanoxylon | Blackwood Acacia | 1956-02-15 | Sidewalk | Cutout | 37.805913 | -122.437521 | 3789 Fillmore St |
| 20 | 30418 | DPW Maintained | Private | 12.0 | NaN | Platanus x hispanica | Sycamore: London Plane | 1956-03-26 | Sidewalk | Cutout | 37.797295 | -122.440879 | 2509 Filbert St |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36068 | 144227 | DPW Maintained | Private | 0.0 | Width 4ft | Agonis flexuosa | Peppermint Willow | 2020-01-25 | Sidewalk | Cutout | 37.773933 | -122.503557 | 782 43rd Ave |
| 36069 | 144230 | DPW Maintained | Private | 0.0 | Width 4ft | Melaleuca quinquenervia | Cajeput | 2020-01-25 | Sidewalk | Cutout | 37.775598 | -122.503676 | 696 43rd Ave |
| 36070 | 261517 | DPW Maintained | Private | 3.0 | Width 3ft | Agonis flexuosa | Peppermint Willow | 2020-01-25 | Sidewalk | Yard | 37.775886 | -122.501730 | 679 41st Ave |
| 36071 | 144157 | DPW Maintained | Private | 0.0 | Width 4ft | Tristaniopsis laurina | Swamp Myrtle | 2020-01-25 | Sidewalk | Cutout | 37.774642 | -122.501452 | 746 41st Ave |
| 36072 | 144192 | DPW Maintained | Private | 0.0 | Width 4ft | Lophostemon confertus | Brisbane Box | 2020-01-25 | Sidewalk | Cutout | 37.776940 | -122.502697 | 618 42nd Ave |
15811 rows × 13 columns
What is the average diameter of the Evergreen Pear tree?
trees[trees.common_name == 'Evergreen Pear'].dbh.mean()
5.306595365418895
trees.groupby(by='common_name').agg('mean')['dbh'].sort_values(ascending=False).head(20)
common_name Date palm (species unknown) 70.000000 False Avocado 35.000000 Canary Island Date Palm 30.912664 Flooded Box: Coolibah 30.000000 Morton Bay Fig 29.000000 Douglas Fir 26.333333 Moraine Ash 26.000000 Burgundy Sweet Gum 24.000000 Yucca 23.000000 Beefwood: Drooping She-Oak 22.666667 Bloodgood London Plane 21.750000 Norfolk Island Pine 20.333333 Shamel Ash: Evergreen Ash 20.294118 Poplar Spp 18.000000 Nichol's Willow-Leafed Peppermint 17.387097 Silver Mountain Gum Tree 17.000000 Silk Oak Tree 'Red Hooks' 16.200000 Siberian Elm 16.105263 Lombardy Poplar 16.000000 Blue Gum 15.250000 Name: dbh, dtype: float64
First things first, let's import the package to help us visualize the data, plotly.
If this package isn't yet included, we can install it using !pip install plotly. More on this week 5.
import plotly.express as px
Note that we're using the sub package of the broader package, called plotly express. This simplifies a lot of the more difficult steps
Plotly express has a broad range of options to play with, let's take a look at the documentation.
Do a quick google search to pull up documentation for px.scatter OR run px.scatter? in a Jupyter cell
px.scatter?
trees_sample = trees.sample(frac=.2)
fig = px.scatter(trees_sample, x='date', y='dbh')
fig.show()
Clearly, there aren't any obvious trends going on from this view. Let's add in some more parameters
fig = px.scatter(trees_sample, x='date', y='dbh',
opacity=.15, color='site_location',
hover_name='common_name', hover_data=['site_location','site_type','address'],
marginal_x = 'histogram', marginal_y = 'histogram',
color_discrete_sequence = px.colors.qualitative.Prism[4:],
labels={'site_location':'Site Location', 'dbh':'Tree Diameter', 'date':'Date Recorded'}
)
fig.show()
The transportation department wants to know track any trees sitting on a road median, in order to quickly remove debris after a bad storm.
fig = px.scatter_mapbox(trees_sample, lat='latitude', lon='longitude', mapbox_style="stamen-terrain", zoom=11,
color='site_location', size='dbh', opacity=.3,
color_discrete_sequence=['orange','red','orange','orange','orange','orange'],
hover_name='address',hover_data=['site_location','caretaker'],
labels={'site_location':'Site Location', 'dbh':'Tree Diameter',
'date':'Date Recorded', 'caretaker':'Care Taker'}
)
fig.show()